Site-Independent Template-Block Detection

نویسندگان

  • Aleksander Kolcz
  • Wen-tau Yih
چکیده

Detection of template and noise blocks in web pages is an important step in improving the performance of information retrieval and content extraction. Of the many approaches proposed, most rely on the assumption of operating within the confines of a single website or require expensive hand-labeling of relevant and non-relevant blocks for model induction. This reduces their applicability, since in many practical scenarios template blocks need to be detected in arbitrary web pages, with no prior knowledge of the site structure. In this work we propose to bridge these two approaches by using within-site template discovery techniques to drive the induction of a site-independent template detector. Our approach eliminates the need for human annotation and produces highly effective models. Experimental results demonstrate the usefulness of the proposed methodology for the important applications of keyword extraction, with relative performance gain as high as 20%.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Automatic Detection of Webpages that Share the Same Web Template

Template extraction is the process of isolating the template of a given webpage. It is widely used in several disciplines, including webpages development, content extraction, block detection, and webpages indexing. One of the main goals of template extraction is identifying a set of webpages with the same template without having to load and analyze too many webpages prior to identifying the tem...

متن کامل

Face Detection Algorithm Based on Multi-orientation Gabor Filters and Feature Fusion

In order to enhance the accuracy of multi-pose and multi-expression face detection, this paper proposes an algorithm based on multi-orientation Gabor feature fusion of mean and variance of subimages. Firstly, to remove the huge background regions, we segmented images based on YCbCr space and then used two-eye templates to locate faces in skin-color regions by template matching. Secondly, the fe...

متن کامل

Requirement of Replication Checkpoint Protein Kinases Mec1/Rad53 for Postreplication Repair in Yeast

UNLABELLED DNA lesions in the template strand block the replication fork. In Saccharomyces cerevisiae, replication through DNA lesions occurs via a Rad6/Rad18-dependent pathway where lesions can be bypassed by the action of translesion synthesis (TLS) DNA polymerases η and ζ or by Rad5-mediated template switching. An alternative Rad6/Rad18-independent but Rad52-dependent template switching path...

متن کامل

مقایسه‌ی دو روش مولکولی PCR و LAMP در تشخیص سالمونلا

Background and Objective: There are several techniques for the diagnosing of salmonella infectious. Several molecular methods such as PCR and hybridization assay have recently been used for the detection of this bacterium. However, these methods require precision instruments for amplification and complex procedures, which are the major obstacles to the widespread use of these methods in relativ...

متن کامل

Template matching based on quadtree Zernike decomposition

In this paper a novel technique for rotation independent template matching via Quadtree Zernike decomposition is presented. Both the template and the target image are decomposed by using a complex polynomial basis. The template is analyzed in block-based manner by using a quad tree decomposition. This allows the system to better identify the object features. Searching for a complex pattern into...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007